Pseudotemporal ordering¶

Motivation¶

Single-cell sequencing assays provide high resolution measurements of biological tissues {cite}Islam2011, {cite}Hwang2018. Consequently, such technologies can help decipher and understand cellular heterogeneity {cite}Briggs2018, {cite}pt:Sikkema2022 and the dynamics of a biological process {cite}Jardine2021, {cite}He2022_fetal_lung.
Corresponding studies include quantifying cellular fates as well as identifying genes driving the process. However, as cells are destroyed when sequenced in classical single-cell RNA sequencing (scRNA-seq) protocols, it is impossible to track their development and, for example, gene expression profile over time. Although recent technological advances allow recording the transcriptome sequentially {cite}Chen2022, they are experimentally challenging and currently fail to scale to larger dataset. Consequently, the underlying dynamic process needs to be estimated from the measured snapshot data, instead.

Even though samples are, traditionally, taken from a single experimental time point, a multitude of cell types can be observed. This diversity stems from the asynchronous nature of biological processes. As such, a range of the developmental process can be observed. Reconstructing the developmental landscape is the goal of the field coined trajectory inference (TI). This task is achieved by ordering the observed cellular states according to the developmental process. States are aligned along the developmental direction by mapping discrete annotations to a continuous domain - the so-called pseudotime.

Pseudotimes rank cells relative to each other according to their respective stage in the developmental process. Less mature cells are assigned small, mature cells large values. Studying a bone marrow sample for example, hematopoietic stem cells are assigned a low, and erythroid cells a high pseudotime. The assignment is, in case of single-cell RNA sequencing data, based on the transcriptomic profile of a cell. Additionally, the construction usually requires the specification of an initial, or, equivalently, root cell where the overall process starts.

Pseudotime construction¶

Pseudotime construction generally follows a common workflow: As a first step, the ultra high-dimensional single-cell data is projected onto a lower dimensional representation. This procedure is justified by the observation that dynamical processes progress on a low-dimensional manifold {cite}pt:Wagner2016. In practice, pseudotime methods may rely on principal components (for example Palantir {cite}Setty2019) or diffusion components (for example diffusion pseudotime (DPT) {cite}Haghverdi2016). Following, pseudotimes are constructed based on one of the following principles.

  1. Observations are first clustered and, following, connections between these clusters identified. The clusters can be ordered and, thereby, a pseudotime constructed. Henceforth, we will refer to this approach as the cluster approach. Classical cluster algorithms include $k$-means {cite}Lloyd1982, {cite}MacQueen1967, Leiden {cite}pt:Traag2019, or hierarchical clustering {cite}Mueller2011. Clusters may be connected based on similarity, or by constructing a minimum spanning tree (MST) {cite}Pettie2002.

  2. The graph approach first finds connections between the lower dimensional representation of the observations. This procedure defines a graph based on which clusters, and thus an ordering, are defined. PAGA {cite}Wolf2019, for example, partitions the graph into Leiden clusters and estimates connections between them. Intuitively, this approach preserves the global topology of the data while analyzing it at a lower resolution. Consequently, the computational efficiency is increased.

  3. Manifold-learning based approaches proceed similar to the cluster approach. However, connections between clusters are defined by using principal curves or graphs to estimate the underlying trajectories. Principal curves find a one-dimensional curve that connects cellular observations in the higher dimensional space. A notable representation of this approach is Slingshot {cite}Street2018.

  4. Probabilistic frameworks assign transition probabilities to ordered cell-cell pairs. Each transition probability quantifies how likely the reference cell is the ancestor of the other cell. These probabilities define random processes that are used to define a pseudotime. DPT, for example, is defined as the difference between consecutive states of a random walk. Contrastingly, Palantir {cite}Setty2019 models trajectories themselves as Markov chains. While both approaches rely on a probabilistic framework, they require a root cell to be specified. The pseudotime itself is computed with respect to this cell.

TI is a well-studied field providing a rich set of methods. To apply the appropriate method to analyze a single-cell dataset, the biological process itself needs to be understood first. This understanding especially includes the nature of the process, i.e., if it, for example, is linear, cyclic, or branching. Similarly, orthogonal processes within one and the same dataset limits the TI methods applicable. To help identifying appropriate tools, dynguidelines {cite}Deconinck2021 provides an exhaustive overview of algorithms and their characteristics.

Down-stream tasks and outlook¶

Even though TI and pseudotime can already provide valuable insight, they usually act as a stepping stone for more fine grained analysis. Identifying terminal states, for example, is a classical biological question that can be studied. Similarly, lineage bifurcation and genes driving fate decisions can be identified based on TI and pseudotime. Which question can answer and how the answer is found is usually method specific. Palantir, for example, identifies terminal states as absorbing states of its constructed Markov chain.

The success of trajectory inference is well documented and, consequently, many methods have been proposed. However, with the advances of sequencing technologies, new sources of information become available. ATAC-seq {cite}Buenrostro2015, CITE-seq {cite}pt:Stoeckius2017, and DOGMA-seq {cite}pt:Mimitou2021, for example, measure additional modalities beyond the transcriptome. Lineage tracing {cite}Weinreb2020 and metabolic labeling {cite}Erhard2019, {cite}Battich2020, {cite}Qiu2020, {cite}Erhard2022 even provide the (likely) future state of a given cell. Consequently, future TI tools will be able to include more information to estimate trajectories and pseudotime more accurately and robustly, and allow answering novel questions. For example, RNA velocity {cite}LaManno2018, {cite}Bergen2020, {cite}Bergen2021 is one technique that uses unspliced and spliced mRNA to infer directed, dynamic information beyond classical, static snapshot data.

Inferring pseudotime for adult human bone marrow¶

To show how a pseudotime can be constructed and different pseudotimes compared, we study a dataset of adult human bone marrow {cite}Setty2019.

Environment setup¶

In [3]:
import numpy as np
import pandas as pd
import scanpy as sc
import matplotlib.pyplot as plt
from scipy.stats import spearmanr, fisher_exact
import statsmodels.api as sm

General settings¶

In [ ]:
#DATA_DIR = Path("../../data/")
#DATA_DIR.mkdir(parents=True, exist_ok=True)

#FILE_NAME = DATA_DIR / "SRR23934263_annotation.h5ad"

Data loading¶

In [4]:
file_path = "/home/jiawu2/pub_scRNAseq_run/SRR23934263_manual_annotated.h5ad"
adata = sc.read(file_path)

print(adata)
AnnData object with n_obs × n_vars = 4315 × 14901
    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_20_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'log1p_total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'log1p_total_counts_hb', 'pct_counts_hb', 'outlier', 'mt_outlier', 'scDblFinder_score', 'scDblFinder_class', 'size_factors', 'leiden', 'leiden_res0_25', 'leiden_res0_5', 'leiden_res1', 'manual_celltype_annotation'
    var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'highly_deviant', 'binomial_deviance', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'dendrogram_leiden_res1', 'hvg', 'leiden', 'leiden_res0_25', 'leiden_res0_25_colors', 'leiden_res0_5', 'leiden_res0_5_colors', 'leiden_res1', 'leiden_res1_colors', 'manual_celltype_annotation_colors', 'neighbors', 'pca', 'rank_genes_groups', 'scDblFinder_class_colors', 'tsne', 'umap'
    obsm: 'X_pca', 'X_tsne', 'X_umap'
    varm: 'PCs'
    layers: 'analytic_pearson_residuals', 'counts', 'log1p_norm', 'scran_normalization', 'soupX_counts'
    obsp: 'connectivities', 'distances'

To construct pseudotimes, the data must be preprocessed. Here, we filter out genes expressed in only a few number of cells (here, at least 20). Notably, the construction of the pseudotime later on is robust to the exact choice of the threshold. Following to this first gene filtering, the cell size is normalized, and counts log1p transformed to reduce the effect of outliers. As usual, we also identify and annotate highly variable genes. Finally, a nearest neighbor graph is constructed based on which we will define the pseudotime. The number of principle components is chosen based on the explained variance.

In [5]:
# Use an existing normalized expression matrix
adata.X = adata.layers["log1p_norm"]  # or "scran_normalization"

The two-dimensional t-SNE or UMAP representation colored by cell type annotations shows that the cell types cluster together well. Additionally, the developmental hierarchy is visible.

In [6]:
#(Optional but recommended) sanity checks: confirm UMAP + neighbor graph exist
assert "connectivities" in adata.obsp, "Missing neighbor graph (adata.obsp['connectivities'])."
assert "X_umap" in adata.obsm, "Missing UMAP embedding (adata.obsm['X_umap'])."
assert "manual_celltype_annotation" in adata.obs, "Missing manual annotations in adata.obs."
In [7]:
#Compute diffusion map (uses the neighbor graph)
sc.tl.diffmap(adata)
In [8]:
#Pick a root cell (simple heuristic: extreme along diffusion component 3)
root_label = "norm HSPC (CD34+ GATA2+ NR1P1+)"
root_mask = adata.obs["manual_celltype_annotation"].astype(str) == root_label
n_root = int(root_mask.sum())
print(f"Root label: {root_label} | cells found: {n_root}")
if n_root == 0:
    raise ValueError(
        f"No cells matched root_label='{root_label}'. "
        "Check exact spelling in adata.obs['manual_celltype_annotation'].unique()."
    )
Root label: norm HSPC (CD34+ GATA2+ NR1P1+) | cells found: 297
In [9]:
# Pick a robust root cell: closest to the HSPC centroid in diffusion space
X_dm = adata.obsm["X_diffmap"]
centroid = X_dm[root_mask].mean(axis=0)
root_ix = int(np.linalg.norm(X_dm - centroid, axis=1).argmin())
adata.uns["iroot"] = root_ix
print("Root cell index (iroot):", root_ix)
Root cell index (iroot): 1454
In [10]:
# Compute DPT pseudotime
sc.tl.dpt(adata)
In [11]:
# Visualize: root cell + pseudotime on the EXISTING UMAP (3 rows)
adata.obs["is_root"] = False
adata.obs.iloc[root_ix, adata.obs.columns.get_loc("is_root")] = True

sc.pl.umap(
    adata,
    color=["manual_celltype_annotation", "is_root", "dpt_pseudotime"],
    ncols=1,                 # <- forces 3 rows
    frameon=False,
    legend_loc="right margin"  # better for long labels
)
No description has been provided for this image
In [12]:
# Optional: show pseudotime alongside clusters
sc.pl.umap(adata, color=["leiden_res1", "dpt_pseudotime"], ncols=2, frameon=False)
No description has been provided for this image
In [ ]:
 

Pseudotime construction¶

In [13]:
#Visualize pseudotime on diffusion map (this is what the tutorial’s “scatter” is good for)
sc.pl.scatter(
    adata,
    basis="diffmap",
    color="dpt_pseudotime",
    frameon=False
)
No description has been provided for this image
In [14]:
#Optional: also color by your annotation to interpret branches:
sc.pl.scatter(
    adata,
    basis="diffmap",
    color="manual_celltype_annotation",
    frameon=False,
    legend_loc="right margin",
)
No description has been provided for this image
In [15]:
genes_to_check = ["MPO", "CD34", "IL1RAP", "SPINK2", "GATA2", "HBB", "CD3D", "CD33", 
                  "CD47", "SPINK2", "CLL1", "CD8A", "CSF3R", "GATA2", "NR1P1"]
genes_to_check = [g for g in genes_to_check if g in adata.var_names]
genes_to_check
Out[15]:
['MPO',
 'CD34',
 'IL1RAP',
 'SPINK2',
 'GATA2',
 'HBB',
 'CD3D',
 'CD33',
 'CD47',
 'SPINK2',
 'CD8A',
 'CSF3R',
 'GATA2']
In [16]:
sc.pl.umap(adata, color=genes_to_check, ncols=4, frameon=False)
No description has been provided for this image

Visualize DPT on diffusion map space (tutorial-style, but useful)¶

In [17]:
#dpt pseudotime score - the higher the more distance from root
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
sc.pl.scatter(
    adata,
    basis="diffmap",
    color="dpt_pseudotime",
    frameon=False
)
<Figure size 1000x600 with 0 Axes>
No description has been provided for this image
In [18]:
#Sanity-check pseudotime distribution by your cell types / clusters
import matplotlib.pyplot as plt

plt.figure(figsize=(14, 6))
ax = sc.pl.violin(
    adata,
    keys="dpt_pseudotime",
    groupby="manual_celltype_annotation",  # original labels
    rotation=90,
    stripplot=False,
    show=False          # <- important
)

# Explicitly remove any legend seaborn added
if ax.legend_ is not None:
    ax.legend_.remove()

plt.tight_layout()
plt.show()
/tmp/ipykernel_2753/3900764808.py:18: UserWarning: This figure includes Axes that are not compatible with tight_layout, so results might be incorrect.
  plt.tight_layout()
<Figure size 1400x600 with 0 Axes>
No description has been provided for this image
In [19]:
sc.pl.violin(
    adata,
    keys="dpt_pseudotime",
    groupby="leiden_res1",
    rotation=90
)
No description has been provided for this image
In [ ]:
adata.write("SRR23934263_dpt.h5ad")

Different pseudotime methods give different results. Sometimes, one pseudotime captures the underlying developmental processes more accurately. Here, we compare the just computed DPT with the pre-computed Palantir pseudotime (see here for the corresponding tutorial). One option to compare different pseudotimes is by coloring the low dimensional embedding of the data (here, t-SNE). Here, DPT is extremely high in the cluster of CLPs compared to all other cell types. Contrastingly, the Palantir pseudotime increases continuously with developmental maturity.

Malignant-only pseudotime analysis¶

Refined pseudotemporal ordering restricted to AML compartments

In [21]:
# subset AFTER all preprocessing is done
malignant_types = [
    "AML (CD34+ CD38low CD33hi IL1RAPlow CD47low)",
    "AML (CD34+ CD200+ FLT3+)",
    "AML (CD34+ Spink2low IL1RAPhi CD33low CLL1low CD47hi)",
    "G/M prog (MPO+ CSF3R+)",
    "Mono/Macro (CD68+ FCN1+ LYZ+ S100AB+ CLL1+ CD123+)",
    "AML? (CD34low CD38low KIT+ MPO+ TIM3+ CD123+)",
]
adata_m = adata[adata.obs["manual_celltype_annotation"].isin(malignant_types)].copy()
print("adata_m:", adata_m.shape)
adata_m: (1794, 14901)
In [22]:
#carry over log1p layer sanity check
layer = "log1p_norm"
assert layer in adata_m.layers.keys(), f"Layer '{layer}' not found. Available: {list(adata_m.layers.keys())}"
In [23]:
# Set root cell (blast-like AML) and run DPT
# =========================
root_label = "AML (CD34+ Spink2low IL1RAPhi CD33low CLL1low CD47hi)"
root_mask = (adata_m.obs["manual_celltype_annotation"] == root_label).values
root_ix = np.flatnonzero(root_mask)

if len(root_ix) == 0:
    raise ValueError(f"No cells found for root label: {root_label}")
adata_m.uns["iroot"] = int(root_ix[0])

# recompute PCA + neighbors + diffmap + DPT (must be done on subset)
sc.pp.pca(adata_m, n_comps=50)
sc.pp.neighbors(adata_m, n_neighbors=15, n_pcs=30)
sc.tl.diffmap(adata_m)
sc.tl.dpt(adata_m)

print("DPT done. dpt_pseudotime range:",
      float(adata_m.obs["dpt_pseudotime"].min()),
      float(adata_m.obs["dpt_pseudotime"].max()))

# UMAP for visualization (compute if not present)
if "X_umap" not in adata_m.obsm:
    sc.tl.umap(adata_m)

sc.pl.umap(adata_m, color=["dpt_pseudotime", "manual_celltype_annotation"], wspace=0.4)
/home/jiawu2/anaconda3/envs/scrna_env/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
DPT done. dpt_pseudotime range: 0.0 1.0
No description has been provided for this image
In [24]:
# Extract CD33 / IL1RAP expression
# =========================
def get_gene_expr(adata_obj, gene, layer_name=None):
    if gene not in adata_obj.var_names:
        matches = [g for g in adata_obj.var_names if g.lower() == gene.lower()]
        if len(matches) == 1:
            gene = matches[0]
        else:
            raise KeyError(f"{gene} not found in var_names. Try searching with: "
                           f"[g for g in adata_obj.var_names if '{gene[:3].lower()}' in g.lower()][:20]")
    gidx = adata_obj.var_names.get_loc(gene)
    X = adata_obj.layers[layer_name] if layer_name else adata_obj.X
    col = X[:, gidx]
    if hasattr(col, "toarray"):
        col = col.toarray()
    return np.asarray(col).reshape(-1)

pt = adata_m.obs["dpt_pseudotime"].astype(float).values
cd33 = get_gene_expr(adata_m, "CD33", layer_name=layer)
il1  = get_gene_expr(adata_m, "IL1RAP", layer_name=layer)

df = pd.DataFrame({
    "pseudotime": pt,
    "CD33": cd33,
    "IL1RAP": il1,
    "celltype": adata_m.obs["manual_celltype_annotation"].astype(str).values
}, index=adata_m.obs_names)
In [25]:
# Sanity check: pseudotime ordering by manual state
# =========================
pt_summary = (
    df.groupby("celltype")["pseudotime"]
    .agg(["count","median","mean","min","max"])
    .sort_values("median")
)
display(pt_summary)
count median mean min max
celltype
AML (CD34+ Spink2low IL1RAPhi CD33low CLL1low CD47hi) 754 0.125293 0.139105 0.000000 0.397086
G/M prog (MPO+ CSF3R+) 176 0.365165 0.356834 0.120234 0.584604
AML (CD34+ CD38low CD33hi IL1RAPlow CD47low) 363 0.573439 0.571346 0.261397 0.865406
AML? (CD34low CD38low KIT+ MPO+ TIM3+ CD123+) 163 0.674177 0.631600 0.134151 0.883389
Mono/Macro (CD68+ FCN1+ LYZ+ S100AB+ CLL1+ CD123+) 338 0.753240 0.736004 0.137456 1.000000
In [27]:
# Quantify gene trends along pseudotime (Spearman + LOWESS plots)
# =========================
r_cd33, p_cd33 = spearmanr(df["pseudotime"], df["CD33"])
r_il1,  p_il1  = spearmanr(df["pseudotime"], df["IL1RAP"])
print(f"Spearman(pseudotime, CD33)  = {r_cd33:.3g}, p={p_cd33:.2e}")
print(f"Spearman(pseudotime, IL1RAP)= {r_il1:.3g}, p={p_il1:.2e}")

def plot_gene_vs_pt(y, gene_name, frac=0.25):
    plt.figure(figsize=(6,4.5))
    plt.scatter(df["pseudotime"], y, s=8, alpha=0.35)
    low = sm.nonparametric.lowess(y, df["pseudotime"], frac=frac, return_sorted=True)
    plt.plot(low[:,0], low[:,1], linewidth=2)
    plt.xlabel("DPT pseudotime (0→1)")
    plt.ylabel(f"{gene_name} ({layer})")
    plt.title(f"{gene_name} vs pseudotime (malignant subset)")
    plt.tight_layout()
    plt.show()

plot_gene_vs_pt(df["CD33"].values, "CD33")
plot_gene_vs_pt(df["IL1RAP"].values, "IL1RAP")
Spearman(pseudotime, CD33)  = 0.0258, p=2.74e-01
Spearman(pseudotime, IL1RAP)= -0.0723, p=2.18e-03
No description has been provided for this image
No description has been provided for this image
In [30]:
# Replot CD33 vs IL1RAP scatter colored by manual annotation
plt.figure(figsize=(5.5,5.5))

for g in df["celltype"].astype("category").cat.categories:
    m = (df["celltype"] == g).values
    plt.scatter(df.loc[m, "CD33"], df.loc[m, "IL1RAP"],
                s=10, alpha=0.5, label=g)

plt.xlabel(f"CD33 ({layer})")
plt.ylabel(f"IL1RAP ({layer})")
plt.title("IL1RAP vs CD33 (malignant subset; manual annotation)")

plt.legend(
    loc="upper center",
    bbox_to_anchor=(0.5, -0.15),
    ncol=2,
    frameon=False,
    fontsize=8
)

plt.tight_layout()
plt.show()
No description has been provided for this image
In [29]:
# Composition across pseudotime (10 bins)
# =========================
df["pt_bin"] = pd.qcut(df["pseudotime"], q=10, labels=[f"bin{i+1}" for i in range(10)])
comp = pd.crosstab(df["pt_bin"], df["celltype"], normalize="index").round(3)
display(comp)

comp.plot(kind="bar", stacked=True, figsize=(9,4))
plt.ylabel("Fraction of cells")
plt.title("Manual state composition across pseudotime (10 bins)")
plt.legend(bbox_to_anchor=(1.02,1), loc="upper left", frameon=False)
plt.tight_layout()
plt.show()
celltype AML (CD34+ CD38low CD33hi IL1RAPlow CD47low) AML (CD34+ Spink2low IL1RAPhi CD33low CLL1low CD47hi) AML? (CD34low CD38low KIT+ MPO+ TIM3+ CD123+) G/M prog (MPO+ CSF3R+) Mono/Macro (CD68+ FCN1+ LYZ+ S100AB+ CLL1+ CD123+)
pt_bin
bin1 0.000 1.000 0.000 0.000 0.000
bin2 0.000 1.000 0.000 0.000 0.000
bin3 0.000 0.916 0.011 0.067 0.006
bin4 0.000 0.856 0.028 0.117 0.000
bin5 0.162 0.408 0.028 0.391 0.011
bin6 0.430 0.022 0.128 0.335 0.084
bin7 0.617 0.000 0.089 0.072 0.222
bin8 0.469 0.000 0.229 0.000 0.302
bin9 0.257 0.000 0.257 0.000 0.486
bin10 0.089 0.000 0.139 0.000 0.772
No description has been provided for this image

the above bin plot tells the CD33 (Blue) comes late than IL1RAP (Red)¶

divided pseudotime (0 → 1) into ordered bins (early → late) and, for each bin plot in bargraph¶

In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 

Considering these observations and prior knowledge about the development in bone marrow, we would conclude to continue working with the Palantir pseudotime.

Key takeaways¶

  • Trajectory inference methods require the start of the biological process to be known (approximately).
  • The nature of the biological process defines which TI algorithms can be used. dynguidelines helps selecting the appropriate TI method.

References¶

{bibliography}
:filter: docname in docnames
:labelprefix: pt

Contributors¶

We gratefully acknowledge the contributions of:

Authors¶

  • Philipp Weiler

Reviewers¶

  • Lukas Heumos